Introduction to Bioconductor

Bioconductor (BioC) is an open source, community driven software project which provides a framework of tools and databases for the analysis of biological data in R.

  • Started in 2001 with Robert Gentleman.
  • Gained popularity through microarray analysis packages (MArray/Limma).
  • Core reviewed and versioned R packages.
  • Two major releases every year (R has one major release).
  • Great support for High Throughput sequencing.

Bioconductor Goals.

To provide -

  • access to analytical and graphical methods for biological data.
  • methods to integrate metadata from biological data repositories.
  • a software platform enabling the rapid development of new tools.
  • a system for reproducible research and analysis documentation.
  • training and workflows.

Bioconductor-website

igv

Bioconductor packages

Current release is Bioconductor 3.6

Includes 2712 packages.

All packages have been - Reviewed. - Tested and evaluated automatically. - Actively maintained and updated.


Packages review.

Before being accepted to Bioconductor all new packages are reviewed so as to pass Bioconductor guidelines.

Review includes automatic testing of packages

  • Testing manual and all examples.
  • Checking code integrity.

As well as an open review on Bioconductor github site.


Bioconductor packages

Review ensures

  • All packages can be built in latest version R.
  • Examples in reference manual can be evaluated.
  • Vignette manual code can all be run.

igv


A Bioconductor package

Example package - BasecallQC

igv


Installing a Bioconductor package

Installing a Bioconductor package is quite straight forward. Every Bioconductor package has a description of the installation R command we can simply copy and paste.

Here we use source function to load a script containing functions for Bioconductor library installation. We then use the newly acquired biocLite to install the library of choice in a manner similar to install.packages

source("https://bioconductor.org/biocLite.R")
biocLite("basecallQC")

Bioconductor package dependencies.

All dependencies and their required versions are resolved for us. We must be careful however to check the version of Bioconductor we are using.

biocVersion()
## [1] '3.5'

If we wish to update to latest Bioconductor release we can use the biocUpgrade function.


Reference Manual

.pull-left[

All packages will have a reference manual containing the help pages for every function.

This will include importantly

  • Details of functions’ inputs and outputs.
  • Working examples.

]

.pull-right[

igv

]


Vignette

.pull-left[

All packages will also include at least one vignette.

These vignettes detail a typical usage of the package with working examples included.

]

.pull-right[

igv

]


Genomics data in Bioconductor

Bioconductor packages cover a wide range of biological data types.

In this course we are focusing on high throughput sequencing so we will focus on the main packages for this.

This includes methods for handling common genomics data types.

  • Fasta and FastQ
  • BED, BED6 and narrowPeak/broadPeak
  • GFF
  • SAM and BAM

FASTA in Bioconductor.

Genomic sequences stored as FASTA files are handled using the Biostrings package.

igv


BED/BED6 in Bioconductor.

Genomic intervals stored as BED files are handled using the rtracklayer and GenomicRanges packages.

igv


Wigs and BigWigs in Bioconductor.

Genomic scores stored as wig or bigWig files are handled using the rtracklayer and GenomicRanges packages.

igv

FastQ

FastQ files containing gene models are handled using the ShortRead package.

igv

Reference Data in Bioconductor.

As well as software packages, we know Bioconductor maintains a number of annotation packages.

This includes microarray annotation, gene to ID mappings, genes’ functional annotation, genome sequence information and gene/trancript models.


Gene annotation.

Information on model organism’s gene annotation is contained with the org.db packages.

Format is org. species . ID type .db

Homo Sapiens annotation with Entrez Gene IDs – org.Hs.eg.db

igv


Genome Sequence

Genomic sequence information is held within the BSgenome packages.

Format is BSgenome. species. source. major version

Homo Sapiens genome sequence from UCSC’s version hg19 – BSgenome.Hsapiens.UCSC.hg19

igv


Gene Models

Gene models are held in the TxDb packages.

Format is TxDb. species . source . major version . table

Homo Sapiens gene build from UCSC’s version hg19 known gene table – TxDb.Hsapiens.UCSC.hg19.knownGene

igv


Time for an exercise.

Link_to_exercises

Link_to_answers